{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/customizing_presidio_analyzer.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/customizing_presidio_analyzer.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customizing the PII analysis process in Microsoft Presidio\n",
    "\n",
    "This notebooks covers different customization use cases to:\n",
    "\n",
    "1. Adapt Presidio to detect new types of PII entities\n",
    "2. Adapt Presidio to detect PII entities in a new language\n",
    "3. Embed new types of detection modules into Presidio, to improve the coverage of the service."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "First, let's install presidio using `pip`. For detailed documentation, see the [installation docs](https://microsoft.github.io/presidio/installation).\n",
    "\n",
    "Install from PyPI:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# download presidio\n",
    "!pip install presidio_analyzer presidio_anonymizer\n",
    "!python -m spacy download en_core_web_lg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting started\n",
    "\n",
    "The high level process in Presidio-Analyzer is the following:\n",
    "![image.png](https://github.com/microsoft/presidio/raw/main/docs/assets/analyzer-design.png)\n",
    "\n",
    "Load the `presidio-analyzer` modules. For more information, see the [analyzer docs](https://microsoft.github.io/presidio/analyzer/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "import pprint\n",
    "\n",
    "from presidio_analyzer import (\n",
    "    AnalyzerEngine,\n",
    "    PatternRecognizer,\n",
    "    EntityRecognizer,\n",
    "    Pattern,\n",
    "    RecognizerResult,\n",
    ")\n",
    "from presidio_analyzer.recognizer_registry import RecognizerRegistry\n",
    "from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts\n",
    "from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper method to print results nicely\n",
    "\n",
    "\n",
    "def print_analyzer_results(results: List[RecognizerResult], text: str):\n",
    "    \"\"\"Print the results in a human readable way.\"\"\"\n",
    "\n",
    "    for i, result in enumerate(results):\n",
    "        print(f\"Result {i}:\")\n",
    "        print(f\" {result}, text: {text[result.start:result.end]}\")\n",
    "\n",
    "        if result.analysis_explanation is not None:\n",
    "            print(f\" {result.analysis_explanation.textual_explanation}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 1: Deny-list based PII recognition\n",
    "In this example, we will pass a short list of tokens which should be marked as PII if detected.\n",
    "First, let's define the tokens we want to treat as PII. In this case it would be a list of titles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles_list = [\n",
    "    \"Sir\",\n",
    "    \"Ma'am\",\n",
    "    \"Madam\",\n",
    "    \"Mr.\",\n",
    "    \"Mrs.\",\n",
    "    \"Ms.\",\n",
    "    \"Miss\",\n",
    "    \"Dr.\",\n",
    "    \"Professor\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, let's create a `PatternRecognizer` which would scan for those titles, by passing a `deny_list`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles_recognizer = PatternRecognizer(supported_entity=\"TITLE\", deny_list=titles_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point we can call our recognizer directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result:\n",
      " [type: TITLE, start: 10, end: 19, score: 1.0]\n"
     ]
    }
   ],
   "source": [
    "text1 = \"I suspect Professor Plum, in the Dining Room, with the candlestick\"\n",
    "result = titles_recognizer.analyze(text1, entities=[\"TITLE\"])\n",
    "print(f\"Result:\\n {result}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's add this new recognizer to the list of recognizers used by the Presidio `AnalyzerEngine`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [],
   "source": [
    "analyzer = AnalyzerEngine()\n",
    "analyzer.registry.add_recognizer(titles_recognizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When initializing the `AnalyzerEngine`, Presidio loads all available recognizers, including the `NlpEngine` used to detect entities, and extract tokens, lemmas and other linguistic features.\n",
    "\n",
    "Let's run the analyzer with the new recognizer in place:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = analyzer.analyze(text=text1, language=\"en\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result 0:\n",
      " type: TITLE, start: 10, end: 19, score: 1.0, text: Professor\n",
      "Result 1:\n",
      " type: PERSON, start: 20, end: 24, score: 0.85, text: Plum\n",
      "Result 2:\n",
      " type: LOCATION, start: 29, end: 44, score: 0.85, text: the Dining Room\n"
     ]
    }
   ],
   "source": [
    "print_analyzer_results(results, text=text1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, both the name \"Plum\" and the title were identified as PII:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Identified these PII entities:\n",
      "- Professor as TITLE\n",
      "- Plum as PERSON\n",
      "- the Dining Room as LOCATION\n"
     ]
    }
   ],
   "source": [
    "print(\"Identified these PII entities:\")\n",
    "for result in results:\n",
    "    print(f\"- {text1[result.start:result.end]} as {result.entity_type}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 2: Regex based PII recognition\n",
    "Another simple recognizer we can add is based on regular expressions. \n",
    "Let's assume we want to be extremely conservative and treat any token which contains a number as PII."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the regex pattern in a Presidio `Pattern` object:\n",
    "numbers_pattern = Pattern(name=\"numbers_pattern\", regex=\"\\d+\", score=0.5)\n",
    "\n",
    "# Define the recognizer with one or more patterns\n",
    "number_recognizer = PatternRecognizer(\n",
    "    supported_entity=\"NUMBER\", patterns=[numbers_pattern]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Testing the recognizer itself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result:\n",
      "[type: NUMBER, start: 10, end: 13, score: 0.5]\n"
     ]
    }
   ],
   "source": [
    "text2 = \"I live in 510 Broad st.\"\n",
    "\n",
    "numbers_result = number_recognizer.analyze(text=text2, entities=[\"NUMBER\"])\n",
    "print(\"Result:\")\n",
    "print(numbers_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to mention that recognizers is likely to have errors, both false-positive and false-negative, which would impact the entire performance of Presidio. Consider testing each recognizer on a representative dataset prior to integrating it into Presidio. For more info, see the [best practices for developing recognizers documentation](https://microsoft.github.io/presidio/analyzer/developing_recognizers/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 3: Rule based logic recognizer\n",
    "\n",
    "Taking the numbers recognizer one step further, let's say we also would like to detect numbers within words, e.g. \"Number One\". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities.\n",
    "\n",
    "Notes:\n",
    "\n",
    "- In this example we would create a new class, which implements [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py), the basic recognizer in Presidio. This abstract class requires us to implement the `load` method and `analyze` method. \n",
    "\n",
    "- Each recognizer accepts an object of type `NlpArtifacts`, which holds pre-computed attributes on the input text.\n",
    "\n",
    "A new recognizer should have this structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyRecognizer(EntityRecognizer):\n",
    "\n",
    "    def load(self) -> None:\n",
    "        \"\"\"No loading is required.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def analyze(\n",
    "        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts\n",
    "    ) -> List[RecognizerResult]:\n",
    "        \"\"\"\n",
    "        Logic for detecting a specific PII\n",
    "        \"\"\"\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, detecting numbers in either numerical or alphabetic (e.g. Forty five) form:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "class NumbersRecognizer(EntityRecognizer):\n",
    "\n",
    "    expected_confidence_level = 0.7  # expected confidence level for this recognizer\n",
    "\n",
    "    def load(self) -> None:\n",
    "        \"\"\"No loading is required.\"\"\"\n",
    "        pass\n",
    "\n",
    "    def analyze(\n",
    "        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts\n",
    "    ) -> List[RecognizerResult]:\n",
    "        \"\"\"\n",
    "        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).\n",
    "        \"\"\"\n",
    "        results = []\n",
    "\n",
    "        # iterate over the spaCy tokens, and call `token.like_num`\n",
    "        for token in nlp_artifacts.tokens:\n",
    "            if token.like_num:\n",
    "                result = RecognizerResult(\n",
    "                    entity_type=\"NUMBER\",\n",
    "                    start=token.idx,\n",
    "                    end=token.idx + len(token),\n",
    "                    score=self.expected_confidence_level,\n",
    "                )\n",
    "                results.append(result)\n",
    "        return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_numbers_recognizer = NumbersRecognizer(supported_entities=[\"NUMBER\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since this recognizer requires the `NlpArtifacts`, we would have to call it as part of the `AnalyzerEngine` flow:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result 0:\n",
      " type: PERSON, start: 0, end: 7, score: 0.85, text: Roberto\n",
      "Result 1:\n",
      " type: LOCATION, start: 25, end: 34, score: 0.85, text: Broad st.\n",
      "Result 2:\n",
      " type: NUMBER, start: 17, end: 21, score: 0.7, text: Five\n",
      "Result 3:\n",
      " type: NUMBER, start: 22, end: 24, score: 0.7, text: 10\n"
     ]
    }
   ],
   "source": [
    "text3 = \"Roberto lives in Five 10 Broad st.\"\n",
    "analyzer = AnalyzerEngine()\n",
    "analyzer.registry.add_recognizer(new_numbers_recognizer)\n",
    "\n",
    "numbers_results2 = analyzer.analyze(text=text3, language=\"en\")\n",
    "print_analyzer_results(numbers_results2, text=text3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The analyzer was able to pick up both numeric and alphabetical numbers, including other types of PII entities from other recognizers (PERSON in this case)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 4: Calling an external service for PII detection\n",
    "\n",
    "In a similar way to example 3, we can write logic to call external services for PII detection. \n",
    "For a detailed example, see [this part of the documentation](https://microsoft.github.io/presidio/analyzer/adding_recognizers/#creating-a-remote-recognizer).\n",
    "\n",
    "[This is a sample implementation of such remote recognizer](https://github.com/microsoft/presidio/blob/main/docs/samples/python/example_remote_recognizer.py).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 5: Supporting new languages\n",
    "\n",
    "Two main parts in Presidio handle the text, and should be adapted if a new language is required:\n",
    "1. The `NlpEngine` containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks.\n",
    "2. The different PII recognizers (`EntityRecognizer` objects) should be adapted or created."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Adapting the NLP engine\n",
    "\n",
    "As its internal NLP engine, Presidio supports both spaCy and Stanza. Make sure you download the required models from spacy/stanza prior to using them. More details [here](https://microsoft.github.io/presidio/analyzer/languages/#configuring-the-nlp-engine). For example, to download the Spanish medium spaCy model: `python -m spacy download es_core_news_md`\n",
    "\n",
    "In this example we will configure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Results from Spanish request:\n",
      "[type: PERSON, start: 13, end: 19, score: 0.85]\n",
      "Results from English request:\n",
      "[type: PERSON, start: 11, end: 17, score: 0.85]\n"
     ]
    }
   ],
   "source": [
    "from presidio_analyzer.nlp_engine import NlpEngineProvider\n",
    "\n",
    "# import spacy\n",
    "# spacy.cli.download(\"es_core_news_md\")\n",
    "\n",
    "# Create configuration containing engine name and models\n",
    "configuration = {\n",
    "    \"nlp_engine_name\": \"spacy\",\n",
    "    \"models\": [\n",
    "        {\"lang_code\": \"es\", \"model_name\": \"es_core_news_md\"},\n",
    "        {\"lang_code\": \"en\", \"model_name\": \"en_core_web_lg\"},\n",
    "    ],\n",
    "}\n",
    "\n",
    "# Create NLP engine based on configuration\n",
    "provider = NlpEngineProvider(nlp_configuration=configuration)\n",
    "nlp_engine_with_spanish = provider.create_engine()\n",
    "\n",
    "# Pass the created NLP engine and supported_languages to the AnalyzerEngine\n",
    "analyzer = AnalyzerEngine(\n",
    "    nlp_engine=nlp_engine_with_spanish, supported_languages=[\"en\", \"es\"]\n",
    ")\n",
    "\n",
    "# Analyze in different languages\n",
    "results_spanish = analyzer.analyze(text=\"Mi nombre es Morris\", language=\"es\")\n",
    "print(\"Results from Spanish request:\")\n",
    "print(results_spanish)\n",
    "\n",
    "results_english = analyzer.analyze(text=\"My name is Morris\", language=\"en\")\n",
    "print(\"Results from English request:\")\n",
    "print(results_english)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [See this documentation](https://microsoft.github.io/presidio/analyzer/languages/) for more details on how to configure Presidio support additional NLP models and languages.\n",
    "- [See this sample](ner_model_configuration.ipynb) for more implemention examples of various NLP engines and NER models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 6: Using context words\n",
    "\n",
    "Presidio has a internal mechanism for leveraging context words. This mechanism would increse the detection confidence of a PII entity in case a specific word appears before or after it.\n",
    "\n",
    "In this example we would first implement a zip code recognizer without context, and then add context to see how the confidence changes. Zip regex patterns (essentially 5 digits) are very week, so we would want the initial confidence to be low, and increased with the existence of context words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result 0:\n",
      " type: US_ZIP_CODE, start: 15, end: 20, score: 0.01, text: 90210\n"
     ]
    }
   ],
   "source": [
    "# Define the regex pattern\n",
    "regex = r\"(\\b\\d{5}(?:\\-\\d{4})?\\b)\"  # very weak regex pattern\n",
    "zipcode_pattern = Pattern(name=\"zip code (weak)\", regex=regex, score=0.01)\n",
    "\n",
    "# Define the recognizer with the defined pattern\n",
    "zipcode_recognizer = PatternRecognizer(\n",
    "    supported_entity=\"US_ZIP_CODE\", patterns=[zipcode_pattern]\n",
    ")\n",
    "\n",
    "registry = RecognizerRegistry()\n",
    "registry.add_recognizer(zipcode_recognizer)\n",
    "analyzer = AnalyzerEngine(registry=registry)\n",
    "\n",
    "# Test\n",
    "text = \"My zip code is 90210\"\n",
    "results = analyzer.analyze(text=text, language=\"en\")\n",
    "print_analyzer_results(results, text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the recognizer with the defined pattern and context words\n",
    "zipcode_recognizer = PatternRecognizer(\n",
    "    supported_entity=\"US_ZIP_CODE\",\n",
    "    patterns=[zipcode_pattern],\n",
    "    context=[\"zip\", \"zipcode\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When creating an ```AnalyzerEngine``` we can provide our own context enhancement logic by passing it to ```context_aware_enhancer``` parameter.\n",
    "```AnalyzerEngine``` will create ```LemmaContextAwareEnhancer``` by default if not passed, which will enhance score of each matched result if it's recognizer holds context words and those words are found in context of the matched entity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "registry = RecognizerRegistry()\n",
    "registry.add_recognizer(zipcode_recognizer)\n",
    "analyzer = AnalyzerEngine(registry=registry)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result:\n",
      "Result 0:\n",
      " type: US_ZIP_CODE, start: 15, end: 20, score: 0.4, text: 90210\n"
     ]
    }
   ],
   "source": [
    "# Test\n",
    "results = analyzer.analyze(text=\"My zip code is 90210\", language=\"en\")\n",
    "print(\"Result:\")\n",
    "print_analyzer_results(results, text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confidence score is now 0.4, instead of 0.01. because ```LemmaContextAwareEnhancer``` default context similarity factor is 0.35 and default minimum score with context similarity is 0.4, we can change that by passing ```context_similarity_factor``` and ```min_score_with_context_similarity``` parameters of ```LemmaContextAwareEnhancer``` to other than values, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "registry = RecognizerRegistry()\n",
    "registry.add_recognizer(zipcode_recognizer)\n",
    "analyzer = AnalyzerEngine(\n",
    "    registry=registry,\n",
    "    context_aware_enhancer=LemmaContextAwareEnhancer(\n",
    "        context_similarity_factor=0.45, min_score_with_context_similarity=0.4\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result:\n",
      "Result 0:\n",
      " type: US_ZIP_CODE, start: 15, end: 20, score: 0.46, text: 90210\n"
     ]
    }
   ],
   "source": [
    "# Test\n",
    "results = analyzer.analyze(text=\"My zip code is 90210\", language=\"en\")\n",
    "print(\"Result:\")\n",
    "print_analyzer_results(results, text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confidence score is now 0.46 because it got enhanced from 0.01 with 0.45 and is more the minimum of 0.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Presidio supports passing a list of outer context in analyzer level, this is useful if the text is coming from a specific column or a specific user input etc. notice how the \"zip\" context word doesn't appear in the text but still enhance the confidence score from 0.01 to 0.4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result:\n",
      "Result 0:\n",
      " type: US_ZIP_CODE, start: 11, end: 16, score: 0.4, text: 90210\n"
     ]
    }
   ],
   "source": [
    "# Define the recognizer with the defined pattern and context words\n",
    "zipcode_recognizer = PatternRecognizer(\n",
    "    supported_entity=\"US_ZIP_CODE\",\n",
    "    patterns=[zipcode_pattern],\n",
    "    context=[\"zip\", \"zipcode\"],\n",
    ")\n",
    "\n",
    "registry = RecognizerRegistry()\n",
    "registry.add_recognizer(zipcode_recognizer)\n",
    "analyzer = AnalyzerEngine(registry=registry)\n",
    "\n",
    "# Test\n",
    "text = \"My code is 90210\"\n",
    "result = analyzer.analyze(text=text, language=\"en\", context=[\"zip\"])\n",
    "print(\"Result:\")\n",
    "print_analyzer_results(result, text=text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Example 7: Tracing the decision process\n",
    "\n",
    "Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:\n",
    "\n",
    "- Which recognizer detected the entity\n",
    "- Which regex pattern was used\n",
    "- Interpretability mechanisms in ML models\n",
    "- Which context words improved the score\n",
    "- Confidence scores before and after each step\n",
    "And more.\n",
    "\n",
    "For more information, refer to the [decision process documentation](https://microsoft.github.io/presidio/analyzer/decision_process/).\n",
    "\n",
    "Let's use the decision process output to understand how the zip code value was detected:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Decision process output:\n",
      "\n",
      "{'original_score': 0.01,\n",
      " 'pattern': '(\\\\b\\\\d{5}(?:\\\\-\\\\d{4})?\\\\b)',\n",
      " 'pattern_name': 'zip code (weak)',\n",
      " 'recognizer': 'PatternRecognizer',\n",
      " 'regex_flags': regex.I|regex.M|regex.S,\n",
      " 'score': 0.4,\n",
      " 'score_context_improvement': 0.39,\n",
      " 'supportive_context_word': 'zip',\n",
      " 'textual_explanation': 'Detected by `PatternRecognizer` using pattern `zip '\n",
      "                        'code (weak)`',\n",
      " 'validation_result': None}\n"
     ]
    }
   ],
   "source": [
    "results = analyzer.analyze(\n",
    "    text=\"My zip code is 90210\", language=\"en\", return_decision_process=True\n",
    ")\n",
    "decision_process = results[0].analysis_explanation\n",
    "\n",
    "pp = pprint.PrettyPrinter()\n",
    "print(\"Decision process output:\\n\")\n",
    "pp.pprint(decision_process.__dict__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When developing new recognizers, one can add information to this explanation and extend it with additional findings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 8: passing a list of words to keep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the built in recognizers that include the `URLRecognizer` and the NLP model `EntityRecognizer` and see the default functionality if we don't specify any list of words for the detector to allow to keep in the text.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result 0:\n",
      " type: PERSON, start: 0, end: 4, score: 0.85, text: Bill\n",
      " Identified as PERSON by Spacy's Named Entity Recognition\n",
      "Result 1:\n",
      " type: URL, start: 27, end: 35, score: 0.85, text: bing.com\n",
      " Detected by `UrlRecognizer` using pattern `Non schema URL`\n",
      "Result 2:\n",
      " type: PERSON, start: 37, end: 42, score: 0.85, text: David\n",
      " Identified as PERSON by Spacy's Named Entity Recognition\n",
      "Result 3:\n",
      " type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com\n",
      " Detected by `UrlRecognizer` using pattern `Non schema URL`\n"
     ]
    }
   ],
   "source": [
    "websites_list = [\"bing.com\", \"microsoft.com\"]\n",
    "text1 = \"Bill's favorite website is bing.com, David's is microsoft.com\"\n",
    "analyzer = AnalyzerEngine()\n",
    "results = analyzer.analyze(text=text1, language=\"en\", return_decision_process=True)\n",
    "print_analyzer_results(results, text=text1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To specify an allow list we just pass a list of values we want to keep as a parameter to call to `analyze`. Now we can see that in the results, `bing.com` is no longer being recognized as a PII item, only `microsoft.com` as well as the named entities are still recognized since we did include it in the allow list. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result 0:\n",
      " type: PERSON, start: 0, end: 4, score: 0.85, text: Bill\n",
      " Identified as PERSON by Spacy's Named Entity Recognition\n",
      "Result 1:\n",
      " type: PERSON, start: 37, end: 42, score: 0.85, text: David\n",
      " Identified as PERSON by Spacy's Named Entity Recognition\n",
      "Result 2:\n",
      " type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com\n",
      " Detected by `UrlRecognizer` using pattern `Non schema URL`\n"
     ]
    }
   ],
   "source": [
    "results = analyzer.analyze(\n",
    "    text=text1,\n",
    "    language=\"en\",\n",
    "    allow_list=[\"bing.com\", \"google.com\"],\n",
    "    return_decision_process=True,\n",
    ")\n",
    "print_analyzer_results(results, text=text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "e69ef4511a97f25b97bbf02af36c33969d28d43609ab16fe063a45fcd77225b1"
  },
  "jupytext": {
   "cell_metadata_filter": "-all",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "presidio_e2e",
   "language": "python",
   "name": "presidio"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
